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The  report  considers  maximum-likelihood  (ML)  and  minimum-cross-entropy  (MCE) 
classification  of  samples  from  an  unknown  probability  density  when  the  hypotheses  comprise  an 
exponential  family.  It  is  shown  that  ML  and  MCE  lead  to  the  same  classification  rule,  and  the 
result  is  illustrated  in  terms  of  a  method  for  estimating  covariance  matrices  recently  developed 
by  Burg,  Luenberger,  and  Wenger.  MCE  classification  applies  to  the  general  case  in  which  it 
cannot  be  assumed  that  the  samples  were  generated  by  one  of  the  hypothesis  densities.  The 
common  use  of  ML  in  this  case  is  technically  incorrect,  but  the  equivalence  of  MCE  and  ML 
provides  a  theoretical  justification. 


ON  A  RELATION  BETWEEN  MAXIMUM-LIKELIHOOD  CLASSIFICATION 
AND  MINIMUM-CROSS-ENTROPY  CLASSIFICATION 


INTRODUCTION 

Maximum  likelihood  (ML)  and  related  classification  methods  are  often  used  to  choose  from  a  set 
of  hypotheses  based  on  known  data.  The  theoretical  justification  for  these  methods  depends  on  the 
assumption  that  one  of  the  hypotheses  is  true,  but  they  are  used  even  when  it  is  known  that  this 
assumption  is  false.  This  practice  can  be  justified  on  the  practical  grounds  that  it  works,  but  there  is  no 
compelling  theoretical  justification.  In  minimum-cross-entropy  (MCE)  classification,  one  classifies  data 
in  terms  of  estimated  underlying  probability  densities  using  a  nearest-neighbor  rule  and  an 
information-theoretic  distortion  measure  [1].  Speech  coding  by  vector  quantization  12,3]  can  be 
derived  as  a  special  case  of  MCE  classification  [1]. 

In  this  report  I  consider  the  relation  between  ML  classification  and  MCE  classification  of  samples 
from  an  unknown  probability  density  when  the  hypotheses  comprise  an  exponential  family.  I  show  that 
ML  and  MCE  lead  to  the  same  classification  rule,  but  that  MCE  applies  in  the  general  case  when  one 
cannot  assume  that  one  of  the  hypotheses  is  true  and  thereby  provides  a  theoretical  foundation  for  the 
technically  incorrect  use  of  ML.  I  illustrate  the  results  in  terms  of  a  recently  developed  method  of 
estimating  covariance  matrices  [4], 

STATEMENT  OF  THE  CLASSIFICATION  PROBLEM 

• 

Let  {£j(x):s€A}  be  a  finite  or  infinite  set  of  probability  densities  on  some  vector  space.  Let  qf(x) 
be  the  probability  density  for  vector-valued  samples  from  some  unknown  process,  and  let 
X  -  x  i,  x2,  . . .  ,  x  M  be  a  sequence  of  M  vector-valued  samples  from  <7+  Let  {//,:s€A}  be  the  set  of 
mutually  exclusive  hypotheses 

Hs  =  X  is  a  sequence  of  independent  samples  from  qs.  (1) 

The  problem  is  to  classify  X  by  choosing  one  of  the  densities  qs.  There  are  really  two  problems  here, 
depending  on  whether  or  not  one  can  assume  a  priori  that  one  of  the  Hs  is  true.  If  so,  then  our  prob¬ 
lem  is  to  find  t  such  that  <7f(x)  -  q,(x).  If  not,  then  the  problem  is  to  find  rsuch  that  <jj,(x)  is  "closest 
to"  tf+(x)  in  some  well-defined,  acceptable  sense.  Most  of  the  time,  the  latter  case  applies— one  cannot 
assume  that  qf  —  q,  for  any  f.  Speech-processing  applications  are  good  examples— speech  is  dealt  with 
in  terms  of  Gaussian  models  even  though  it  is  well  known  that  speech  is  not  Gaussian.  We  restrict  con¬ 
sideration  to  classification  densities  q,  that  comprise  an  exponential  family, 

qs(x)  -  p(x)exp  -\(l)  -  £  /3 *5>/*(x)  ,  (2) 

1 

where  p(x)  and  /*(x)  are  fixed  functions  and  \(l)  and  /3*(j)  are  constants.  A  set  of  Gaussian  densities 
is  one  example  of  such  an  exponential  family.  We  place  no  restrictions  on  the  unknown  process  q\ 

Exponential  families  can  always  be  expressed  as  the  result  of  a  minimum  cross-entropy  problem 
(5-7).  In  particular,  the  qs  satisfy 

LIfi.  tsl ->  / "''X  it\ 


H(qs,p )  -  min  H{q',p), 

f' 


Manuscript  approved  February  14,  1983. 


f o  f  \ 


Codes 

Avail  and/or 
3  i ..  1  Special 


JOHN  E.  SHORE 


where  H  is  the  cross-entropy  (discrimination  information,  directed  divergence,  i-divergence,  Kullback- 
Liebler  number,  etc.). 


H(q.p) 


g(x)log 


q(x) 

p(x) 


dx, 


(4) 


and  where  q'  varies  over  the  set  of  densities  that  satisfy  the  constraints 


f 


qs(x)fk(x)dx  -  Fk 


(j) 


(5) 


for  known  numbers  Fks).  In  the  solution  (2)  the  constants  pks)  and  X(l)  are  Lagrangian  multipliers 
chosen  to  satisfy  the  constraints  (5)  and 

J  qs(x)  dx  -  I. 


In  the  notation  of  Refs.  7  and  8,  one  can  express  (3)  as  q,  —  p  «  /„  where  I,  represents  the  informa¬ 
tion  given  by  the  constraints  (5).  The  density  p  is  called  the  prior ,  and  the  densities  q,  are  called  poste¬ 
riors. 


REVIEW  OF  THE  TWO  CLASSIFICATION  METHODS 

Maximum-Likelihood  Classification 

In  the  maximum-likelihood  (ML)  approach  one  classifies  X  by 

max  p(X  |//,),  (6) 

J 

where  p(X\Hs)  is  the  probability  that  X  is  the  result  of  n  independent  samples  from  q,(x).  Bayes’s 
law  yields 

p(H,\X)  -  p(X\Hs)  yfyy. 

so  that  ML  classification  is  equivalent  to  maximum-a-posteriori  (MAP)  classification, 

max  p  (//j  | X  ), 

S 

when  the  hypotheses  Hs  have  equal  prior  probabilities.  ML  classification  is  used  in  a  variety  of  applica¬ 
tions,  even  when  clearly  one  cannot  assume  a  priori  that  one  of  the  hypotheses  is  true.  This  practice 
can  be  justified  on  practical  grounds— it  works— but  it  has  not  been  justified  on  compelling  theoretical 
grounds. 

Minimum-Cross-Entropy  Classification 

Minimum-cross-entropy  (MCE)  classification  of  information  from  the  unknown  process  q f 
proceeds  from  knowledge  of  the  expectations 

J'  qHx)fk(x)  dx  -  Fk,  (7) 

that  is,  expectations  of  the  same  constraint  functions  fk(x)  as  in  (2).  The  quantity  F  =  F\,  •  •  •  ,F„  is 
called  a  feature  vector—  its  elements  are  the  data  to  be  classified.  Let  /  represent  the  constraints  (7),  and 
let  the  density  p  in  (2)  be  considered  as  a  prior  estimate  of  q f.  Then  a  method  of  classifying  F  using 
MCE  consists  of  the  following  two-step  procedure  [1 ): 


I .  Compute  q  —  p  «  /,  the  minimum-cross-entropy  estimate  of  q +  based  on  the  information 
(7). 


NRL  REPORT  8707 


2.  Choose  one  of  the  classification  densities  by  the  MCE  rule 

min  H(q,qs).  (8) 

j€  A 


In  Ref.  1  it  is  shown  that 

H(q',qs)  -  H (q* ,q)  +  H (q,qs)  (9) 

holds.  Now,  the  MCE  estimate  q  —  p  °  /  minimizes  the  term  H(q*,q)  in  the  following  sense:  Of  all 
densities  having  the  general  form  (2),  q  is  the  closest  possible  density  to  q\  (This  property  is  known 
as  expectation-value  matching  (7].)  Since  the  second  term  on  the  right-hand  side  of  (9)  is  minimized 
by  (8),  it  follows  that  MCE  classification  is  optimal  in  the  sense  of  minimizing  the  total  distortion 
H(q*,q,).  An  alternative  MCE  method  of  classifying  F  is  to  use  the  rule 

min  H(qs  0  I,qs).  (10) 

$€  A 

In  words,  each  of  the  classification  densities  qs  is  in  turn  considered  as  a  prior  estimate  of  qf;  when  the 
information  F  is  taken  into  account,  the  resulting  posterior  estimate  of  q +  is  qs  °  /.  The  rule  (10) 
chooses  the  classification  density  qs  that,  when  considered  as  a  prior  estimate  of  <?f,  is  changed  the  least 
by  taking  F  into  account. 

Both  of  the  MCE  rules  (8)  and  (10)  have  compelling  intuitive  and  information-theoretic 
justifications.  Fortunately  one  does  not  have  to  choose  between  them.  Because  the  constraints  (5)  and 
(7)  involve  the  same  constraint  functions  /*(x),  follows  (7,  Property  14]  that 

qs  »  1  —  (p  o  Is)  o  I  -  p  °  /  —  q  (11) 

holds,  which  in  turn  means  that  (8)  and  (10)  are  equivalent. 


Computationally,  it  turns  out  that  one  need  not  compute  q  -  p  »  I  =-  qs  »  /,  as  the  rules  (8)  and 
(10)  are  equivalent  to 

»  « , 


min 

j€A 


x(l)  +  I  WFk\, 


*-i 


(12) 


where  the  and  /3*s)  are  the  Lagrangian  multipliers  from  the  classification  densities  (2)  [1], 


For  the  application  being  considered  here,  the  expectations  Fk  are  estimated  from  X  by 


1  M 

F*  “  U  I/*(x')’ 


(13) 


COMPARISON  OF  THE  CLASSIFICATION  METHODS 


I  begin  the  comparison  by  computing  the  consequences  of  the  ML  rule  (6)  given  the  form  (2)  for 
the  classification  densities.  One  has 


M 


p(X  \ffs)  =■  IJ  qs(x ,) 


i- 1 

exp 


-m\m-  £  £&s7*(x() 


/-i  *-i 


M 


n  p(x/). 


(14) 


/-I 


bearing  in  mind  that  this  is  valid  only  if  one  knows  that  X  came  from  one  of  the  qs(x). 


The  ML  rule  (6)  is  equivalent  to  the  rule 

min  {  -  log  p(X  | //,)). 
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Substitution  of  (14)  yields 

min  (  A/\(l>  +  £  £  fl**'/^*/)  ~  £  log  />(*,)  .  (15) 

s  \  i-l  k—  1  i— I 

The  last  sum  in  (15)  involves  terms  independent  of  sand  can  therefore  be  dropped.  Also,  dividing  by 
the  constant  M  has  no  effect.  Hence  (15)  is  equivalent  to 

min  X(l>  +  £  p(ka)  JL  £/*(x,)  . 

Substitution  of  (13)  yields 

min  |\(l)  +  £  b^Fk J. 

which  is  the  same  as  the  MCE  rule  (12). 

I  have  just  shown  that  ML  classification  is  equivalent  to  MCE  classification  when  one  can  assume 
that  X  comes  from  one  of  the  classification  densities  qs.  This  fact  was  shown  previously  by  Kupperman 
[9]  and  Kullback  [5],  although  the  derivation  here  is  carried  out  more  directly  and  in  terms  of  the  com¬ 
putational  MCE  classification  rule  (12)  that  was  derived  in  Ref.  1.  Recently,  Csiszar  and  Tusnady  have 
considered  the  connection  between  ML  and  MCE  when  X  results  from  a  mapping  of  samples  from  one 
of  the  q,  [10]. 

What  about  the  case  when  one  cannot  assume  that  X  comes  from  one  of  the  classification  densi¬ 
ties  q,f  In  this  case  it  is  common  to  use  the  ML  rule  (6)  anyway,  without  good  theoretical  justification. 
But  the  case  is  covered  by  MCE  classification,  because  rule  (12)  was  derived  out  in  Ref.  1  without 
assuming  that  the  feature  vector  F  is  the  same  as  any  of  the  F <j)  that  determine  the  classification  den¬ 
sities  by  (3),  (4),  and  (5)  or  that  estimates  of  F  are  obtained  by  sampling  one  of  the  q,.  It  was 
assumed  only  that  the  goal  is  to  find  the  F(,)  that  "best  resembles"  F  and  that  the  MCE  criterion  (8)  is 
reasonable.  When  X  cannot  be  assumed  to  come  from  one  of  the  qs,  it  turns  out  that  those  who  apply 
ML  anyway  are  doing  MCE  classification. 

DISCUSSION 

MCE  classification  provides  a  general  method  for  taking  a  sequence  of  independent  vector-valued 
samples  x,  from  an  unknown  process  q*  and  classifying  that  sequence  by  identifying  a  member  of  a  set 
of  exponential-class  densities  {$s(x):s€A}.  The  classification  rule  (12)  combines  the  results  of  a  two- 
step  procedure:  The  first  step  obtains  from  X  a  minimum-cross-entropy  estimate  q  of  q\  The  second 
step  identifies  the  density  q,  that  is  closest  to  q  in  the  cross-entropy  sense.  With  the  assumption  that 
the  X/  come  from  one  of  the  q„  MCE  classification  reduces  to  ML  classification.  Without  this  assump¬ 
tion  MCE  classification  applies  anyway  and  thereby  provides  a  theoretical  justification  for  the  technically 
incorrect  use  of  ML. 

Furthermore,  the  q,  may  themselves  be  approximations  if  the  constraints  F (s)  in  (5)  are  approxi¬ 
mations  based  on  training  data  in  the  same  sense  as  (13).  That  is,  the  q,  may  be  approximations  based 
on  samples  from  "true  densities"  qf.  Then,  even  if  one  can  assume  that  the  classification-data  vector  X 
comes  from  one  of  the  qf,  one  cannot  assume  that  X  comes  from  one  of  the  classification  densities  q,\ 
again,  ML  cannot  be  applied  in  principle. 

AN  EXAMPLE-ESTIMATION  OF  STRUCTURED  COVARIANCE  MATRICES 

Recently,  Burg,  Luenberger,  and  Wenger  [4]  have  generalized  the  popular  Burg  technique  [11]  for 
estimating  the  autocorrelation  function  of  a  random  process  from  time-domain  samples.  The  new 
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method  estimates  covariance  matrix  of  specified  structure  from  vector-valued  samples  of  a  random  pro¬ 
cess  Written  in  terms  of  the  notation  here,  Burg  et  al.  consider  the  set  of  classification  densities 

q,(x)  —  (2fr)-/i'v|R  s|-l/i  exp(— 'k  x'»Rs_1»x).  (16) 

This  is  Eq.  (1)  in  Ref.  4.  The  superscript  t  indicates  a  transpose,  the  raised  dot  (•)  indicates  a  vector  or 
matrix  product,  and  {R,:s€A}  is  a  finite  or  infinite  set  of  feasible  covariance  matrices.  Given  a  data 
vector  X  consisting  of  M  vector-valued  samples  from  an  unknown  density  <7f(x),  the  sample  covari¬ 
ance  matrix  R  is  defined  as 


R 


1  M 

-77  Ix'x'- 


(17) 


Burg  et  al.  assume  that  X  came  from  one  of  the  qs  (that  is,  =  qs  holds  for  some  s),  and  they  classify 
X  by  the  ML  rule  (6).  The  result  is  the  classification  rule 


max 

5 


—log  |R  j|  —  Tr  (RS-'.R) 


(18) 


where  Tr  indicates  a  trace  operation.  This  is  Eq.  (4)  in  Ref.  4,  except  that  R  and  S  are  replaced 
respectively  by  R,  and  R. 


Since  (16)  belongs  to  the  class  of  generalized  exponentials,  the  results  of  the  section  beginning  on 
page  2  apply— (18)  must  be  equivalent  to  MCE  classification,  and  (18)  must  also  apply  in  the  more 
realistic  case  where  one  cannot  assume  that  X  comes  from  one  of  the  qs.  For  completeness,  one  can 
demonstrate  the  connection  explicitely  by  showing  that  (18)  is  a  special  case  of  the  MCE  rule  (12). 


One  needs  to  express  (16)  as  minimum-cross-entropy  posteriors  qs  -  p 
to  express  (16)  in  the  form  (2).  As  a  prior,  one  can  use 

p(x)  —  (27r)~Viyvexp(-V$xt»I»x), 

where  I  is  the  identity  matrix.  Using  (19),  one  rewrites  (16)  as 

qs(x)  -  p(x)  |Rjl~'A  exp(-'/2  xt»(R71  -  I)»x). 

Defining 

X(l)  “  —log  |Rsh,/5 


and 


permits  one  to  rewrite  (20)  as 


^-vHRr'-u,, 


<7j(x)  -  p(x)exp 


-X(s)  -  £(S 


Is.  That  is,  one  needs 

(19) 

(20) 

(21) 

(22) 

(23) 


which  is  just  the  desired  form  (2).  The  constraint  functions  in  this  case  are  ftj{x )  =  X/Xj. 
tations  (5)  are  just  the  covari  inces 


qs(x)x,xjdx  -  {RjV 


The  expec- 


Given  the  data  vector  X,  elements  (R  }</  of  the  sample  covariance  matrix  (17)  are  just  estimates 
of  the  expectations  fdx  q*(x)xtXj.  Hence,  using  (17),  (21),  and  (22),  one  can  write  the  MCE 
classification  rule  (12)  as 
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min 

5 


%log|R,|  +  a  £{RJ-'-l}/y{R}< 


(24) 


The  term  involving  the  identity  matrix  I  does  not  depend  on  s.  It  follows  that  (24)  is  equivalent  to 

mini  log  |Rj|  +  £  {R  7l}t/{R  }/yl.  (25) 


where  the  factor  'h  has  also  been  dropped.  Since  R  is  symmetric,  as  can  be  seen  from  (17),  then 


£{RJ-'}</{R}/y-Tr(Rr'.R). 


Eq.  (25)  then  becomes 

which  is  equivalent  to  (16). 
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